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Summary 

We present a parsing algorithm with polynomial time complexity for a large subset of V-TAG languages. V-TAG, a 
variant of multi-component TAG, can handle free-word order phenomena which are beyond the class LCFRS (which 
includes regular TAG). Our algorithm is based on a CYK-style parser for TAGs. 



1 Introduction 



> 
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Long-Distance Scrambling is a word-order phenomenon which is "doubly unbounded" in that (i) more than one element 
QQ can move, and (ii) movement can be unbounded. In (Becker et al., 1991), we argue that scrambling is beyond TAG 

f^ by assuming that elementary trees express a complete predicate-argument structure. In (Becker et al., 1992), we show 

^^ that no formalism in the class LCFRS (which includes TAG) can derive scrambling. (Becker et al., 1991) proposes two 

^~~' variants of the TAG formalism which can derive scrambling while still preserving most of the desirable properties of 

TAGs (i.e., an extended domain of locality and the factoring of recursion). However, little is known about the formal 
Qs and computational properties of those systems. (Rambow, 1994) proposes V-TAG, which is closely related to one of the 

previously proposed varaiants, but redefines the derivation relation. 

V-TAG can derive the relevant set of sentences and also cases where scrambling co-occurs with long-distance 
topicalization (a separate linguistic phenomenon also found in English, in which a single element moves into sentence- 
initial position): 



O (1) [Dieses Buch] J hat [den Kindern] j bisher noch niemand [PRO tj t, zu geben] versucht. 

[this book]Acc has [the childrenlnAT so far yet [no-one]NOM to give tried 

So far, no-one has tried to give this book to the children. 

We refer to (Miiller and Sternefeld, 1993) for a more extensive discussion of the freedom of scrambling in German, 
Japanese, and Russian. In this paper, we give a parsing algorithm with polynomial time-complexity for lexicalized 
V-TAG languages. 

2 V-TAG 

Multi-Component TAG (MC-TAG, see (Weir, 1988) for a broader discussion) extends the elementary structures of the 
grammar from trees to sets' of trees. The formal and computational properties of MC-TAG depend on the exact definition 
of adjunction. "Tree-local" and "set-local" MC-TAG, in which the adjunction sites are restricted, are polynomially 
parseable, but since they are included in LCFRS, they are not adequate for deriving scrambling (see Section 1). (Weir, 
1988) also defines "non-local MC-TAG", in which trees from one set must be adjoined simultaneously anywhere into a 
derived tree. As shown in (Becker et al., 1991), non-local MC-TAG can handle scrambling. Unfortunately, it is known 
to generate NP-complete languages (Rambow and Satta, 1992). 



' If a set includes two trees with identical labeling, we assume that the node-addresses are different. 



In V-TAG, introduced in (Rambow, 1994), there are no restrictions on adjunction sites. Trees from one tree set can 
be adjoined anywhere in the derived tree, and they need not be adjoined simuhaneously or in a fixed order. Furthermore, 
trees in the tree sets are equipped with dominance links, first formally defined in (Becker et al., 1991),which have been 
used previously in linguistic work (for example by (Kroch, 1989)). A dominance link can relate the foot node of a tree to 
any node in any tree of the same set. The dominance links provide a constraint on possible derivations: after a derivation 
is completed, each dominance link must hold in the derived tree. Dominance links are essential for encoding structural 
relations (c-command) between related linguistic elements, such as a head and its arguments. 
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Figure 1 : Initial tree set for versuchen matrix clause and geben embedded clause 



zu geben 



For illustrative purposes, we give a V-TAG derivation for sentence (1). The grammar of German is the set of tree 
sets. Each tree set contains a head (e.g., a verb) and its projections, and slots for its arguments. Two examples are shown 
in Figure 1 . In the set for the geben 'to give' embedded clause, one nominal argument is in a separate auxiliary tree, 
reflecting the fact that it may be scrambled, and the other nominal argument is included in the verbal projection tree, 
reflecting the fact that it is (long-distance) topicalized. The dotted line represents the dominance link. In the set for the 
versuchen 'to try' matrix clause, the only nominal argument is in a separate auxiliary tree. Its clausal subcategorization 
requirement is indicated by the fact that the verb is in an auxiliary tree (rooted in VP), forcing adjunction into an embedded 
clause. 
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Figure 2: After adjoining matrix clause into subordinate clause (left) and final derived tree (right) 



The derivation now proceeds by first adjoining the matrix clause into the embedded clause at the VP node, yielding 
the structure on the left in Figure 2. This adjunction implements the long-distance topicalization of the embedded direct 
argument. We are left with two auxiliary trees that still need to be adjoined, representing the scrambled arguments. We 
first adjoin the matrix subject into its own clause, and then adjoin the embedded indirect object just above the matrix 
subject. The result is shown in Figure 2 on the right. 

Observe that the tree sets given in Figure 1 have the property that they each represent a verb. In linguistic applications 
of TAG and related formalisms such as V-TAG, it is useful to associate each elementary structure (tree set in the case 
of V-TAG) with at least one lexical item. Such a grammar is called "lexicalized". This has an important consequence, 
namely that derivations in a lexicalized grammar are always bounded in length by a linear function of the length of the 



derived sentence. In the following discussion of a parser for V-TAG, we will make crucial use of this property. 

3 Parsing V-TAG 

In this section, we use an extension of the CYK-type parser for TAG defined by Vijay-Shanker (1987, p. 110) to give a 
polynomial time parser for a large subset of the V-TAG languages. We first describe Vijay-Shanker's parser for simple 
TAG, and then describe the extensions necessary for V-TAG. 

The main idea of Vijay-Shanker's parser is the introduction of a 4-dimensional matrix T, in which an entry of a node 
T] from an elementary tree t at T[i , j , k , I] represents the fact that either 

(i) there is some derived tree r' such that t] is its root node and t] dominates the substring aj_|_ i • • -ajijia]. ■ ■ ■ ai where 
rj\ is the (label of the) foot node of r or 

(ii) there is some derived tree r' such that ry is its root node and ry dominates the substring ai+i -a; and j = k. 

We split every node into a top and a bottom version, similar to the definition of "top" and "bottom" features in a 
feature-based TAG (Vijay-Shanker, 1987). If ry is a node in some tree of some set of a VTAG, then rj^ denotes the top 
version of that node, and rf' the bottom version. The parser fills the matrix T bottom-up, starting from entries for the 
leaves. (We assume that the grammar is in extended two form, i.e., in every tree every node has at most two children.) 
There are six cases which fall into two basic categories^: 

(i) Cases 1 to 4 correspond to bottom-up context-free expansions within one elementary tree. Figure 3 shows Case 1 . 
(ii) Cases 5a, 5b, and 6 deal with adjunction. Cases 5a and 5b correspond to adjunction (either at a node which dominates 
the foot node (5a) or not (5b)). The top version of the node is added to the matrix to reflect the string covered after 
adjunction at that node has taken place, as illustrated in Figure 3 for Case 5a. Case 6 corresponds to no adjunction: the 
top version of a node is added if the bottom version is already present in the same cell of the matrix. 
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Figure 3: Cases 1 and 5a. 




We now turn to the extensions necessary to handle V-TAG. We first introduce some additional terminology. If two 
nodes ryi and ry2 are linked by a dominance link such that rji dominates ri2, then we will say that ryj has apassive dominance 
requirement and that rj2 has an active dominance requirement. If the tree of which rj\ {rj2) is a node has been adjoined 
during a derivation, but the tree of which rj2 {rj\) is a node has not, the dominance requirement (passive or active) will be 
called unfulfilled. The multiset of unfulfilled active dominance requirements of a node ry will be denoted by T(ry), and 
the multiset of all passive dominance requirements will be denoted by -L(?y). We extend this notation to derived trees. 
Let r be a derived tree at any intermediate step of a derivation. We associate with r multi-sets which represent all the 
unfulfilled active and passive dominance requirements of nodes in r, written T(r) and -L(t), respectively. Observe that a 
(partial) derived initial tree (i.e., a tree without a footnode on its frontier) cannot have any unfulfilled passive dominance 
requirements if it is to be part of a successful derivation. 

Note that in a lexicalized V-TAG in every derivation | T(r) | and | -L(t) | are always linear with respect to the length 
of the input string. 

In order to keep track of unfulfilled dominance requirements, we add to each entry in the matrix two link-counters 
which record the number and type of active and passive dominance requirements, respectively, which still need to be 
satisfied.^ A link-counter 7 is an array whose elements are indexed on the dominance links of G, and whose values are 
integers. The sum of two counters is defined component-wise, the norm 1 7 | is defined as the sum of all components. 
We will denote by 7^ the active requirement counter, by 7"*" the passive requirement counter, and by the counter all of 
whose values are 0. 



It is clear how to restrict tliese cases to implement the adjunction constraints (i.e., obligatory, selective and null adjoining). 
^This approach is based on a related technique used in (Satta, 1993). 



We now spell out what happens to the link-counters in the six cases of the parser. In the following, a — & is defined 
to be a — & if a > &, and otherwise. 

Case 1: rji dominates the foot node (see Figure 3). If there is (ry|, 7/", 7^) G T[i,j, k, m] and (ryj, 0, 7^) £ T[m,p,p, I], 
k <m <p< /, thenadd(ry'^,7/-,7^ + 7 J + T(ry)) toT[i, j,k,l]. 
Cases 2 to 4: are similar to Case 1 . 
Case 5a: ryi dominates the foot node. If there is {t]J , J^ jjj) G T[m, j, k,p] and (ry2, I2 tIi) ^ ^[*' "^' P^ '] where ry2 

is the root node of an auxiliary tree with the same symbol as ryi, then add (ry|, (72" — 71'') + 71*" , (71'' ~ 72" ) + tJ ) '^^ 

T[i,i,k,l]. 

Case 5b: ryi does not dominate the foot node. As 5a, except that then 7/- = 0, and the move is only valid if 72^ < 71*^ • 

Case 6: No adjoining takes place at node ry. If there is (?y^, 7""" , 7''') G T[i, j, k, /], then add [rj^ , 7-"- , 7''') to T[i, j, k, I]. 

In all six cases, after calculating the new 7"*" and 7^, the entry is discarded if 1 7"*" + 7^ | > c ■ n, where c is the 
maximal number of links in a tree set of the grammar. The recognition of a string ai • • • a„ is successful if for some j, 
< i < n, and some ry, a root node of an initial tree, we have (ry^, 0, 0) G T[0, j, j, n\. 

Finally, we can present the algorithm: 

Input: ai • • • a„ , n > Output: ACCEPT/REJECT 

FOR EVERY i G {0..n - 1} "Initialize with leaves" 

IF A LEAF-NODE Tj OF AN ELEMENTARY TREE IS LABELED flj- THEN PUT {if , 0, T(ry)) IN T[i, i + 1 , i + 1 , i + 1] 

FOR EVERY i, j G {0..n - 1}, i < j "Initialize with foot nodes" 

FOR EVERY AUXILIARY TREE (WITH FOOT NODE ry): PUT (ry^, _L(ry), T(ry)) IN T[i, i, j, j] 

REPEAT FOR EVERY i, j, k, / G {0..n}, i < j < k < I "parse bottom-up" 

DO Case 1 , 2, 3, 4, 5a, 5&, 6 "add a new entry" 

UNTIL T UNCHANGED 

ACCEPT IF [t], 0, 0) G T[0, j, j,n\l,0 < j < n AND T] IS ROOT OF SOME INITIAL TREE 

Theorem: A lexicalized V-TAG is parsable in deterministic polynomial time. 

The correctness of the recognition algorithm for TAG is proven by Vijay-Shanker (1987). It can easily be seen by 
induction on the number of dominance links that the link-counters correctly impose the dominance constraints. 

The time complexity of the algorithm is that of Vijay-Shanker's algorithm, 0{n^), multiplied by a factor representing 
cube of the maximal number of elements of each cell of matrix T. Since 1 7 | < c • n, we have that the number of possible 
Hnk-counters is bounded by O(nl^l) (where \L\ is the total number of links in G), and the the time complexity of the 
algorithm is in 0{\G\^n^^^^n^), which is polynomial in n. 

Using back pointers (e.g., for every (rj, 7) which is added to T, pointers to the contributing nodes ryi and ry2 in their 
respective positions are added), the matrix T can be augmented to represent a parse forest from which all derivations of 
an accepted string can be constructed. 
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